13 Jul 2020


Background

Open Access Coronavirus Disease Epidemiological Data

Johns Hopkins University

The Center for Systems Science and Engineering (CSSE) at Johns Hopkins University provides a public, global COVID-19 Github repository (https://github.com/CSSEGISandData/COVID-19) with anonymous patient data aggregated from a number of sources.

We have built a centralised repository of individual-level information on patients with laboratory-confirmed COVID-19 (in China, confirmed by detection of virus nucleic acid at the City and Provincial Centers for Disease Control and Prevention), including their travel history, location (highest resolution available and corresponding latitude and longitude), symptoms, and reported onset dates, as well as confirmation dates and basic demographics. Information is collated from a variety of sources, including official reports from WHO, Ministries of Health, and Chinese local, provincial, and national health authorities. If additional data are available from reliable online reports, they are included. Data are available openly and are updated on a regular basis (around twice a day).

CSSE Data Sources (partial list):

The CSSE data are used for all global analyses in this document.

The New York Times

The New York Times has also provided public human coronavirus disease case and death data for the United States by county and by state. The U.S. data used for this analysis is pulled directly from The New York Times COVID-19 Github repository (https://github.com/nytimes/covid-19-data).

The New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time. We are compiling this time series data from state and local governments and health departments in an attempt to provide a complete record of the ongoing outbreak.

Since late January, The Times has tracked cases of coronavirus in real time as they were identified after testing. Because of the widespread shortage of testing, however, the data is necessarily limited in the picture it presents of the outbreak.

We have used this data to power our maps and reporting tracking the outbreak, and it is now being made available to the public in response to requests from researchers, scientists and government officials who would like access to the data to better understand the outbreak.

The data begins with the first reported coronavirus case in Washington State on Jan. 21, 2020. We will publish regular updates to the data in this repository.


Data Analysis

The COVID-19 data in the New York Times GitHub repository is structured as three main comma-separated value data files—one top-level country summary file, one state-level summary file, and one data file containing reported case and death data for each individual U.S. county. Each of these is used for this analysis. The data from each of these files is used to calculate the rate of reported new cases and deaths for each state and county, and these rates are used to build a predictive model by linear regression using least-squares methods for each entity. A risk estimate (ρ) is generated from these models, and the states and counties with the highest estimated risk are compared in the charts shown in this document. In the charts showing new reported cases and deaths, a generalized additive model (GAM) smoothing function was fit to each data set.

The risk assessment methodology used in this analysis has not been validated and is subject to noise in the data. There is a phenomenon that has been reported in the White House press briefings about the COVID-19 response whereby some counties report updates to the county data on Mondays for the incremental changes over the weekend. In fact, cyclical weekly variation can be seen in the reported case and death data. This limits the accuracy of the model to some extent. To enable more robustness to this variation in the estimation of risk, data over a several-day period is used as a compromise between speed of detection of a significant change in the risk estimate and estimation error due to high sensitivity to noise in the data.

The predictive analytics model is built with the open-source R programming language using the Tidyverse family of packages.




Summary Results

U.S.

There have been 3,318,347 total COVID-19 cases (57,789 new cases per day) and 134,976 deaths (395 new deaths per day) in the United States to date.







Comparison with the EU

The aggregated data from Johns Hopkins University CSSE was used to calculate a combined case rate for the 27 member states of the European Union (EU). The combined data were used to compare the pandemic response in the EU with the response in the U.S. over time. The rise in infections in the EU preceded the rise in the U.S. For time comparison, the 2500th case recorded in the EU occurred on 02 Mar 2020. The 2500th case in the U.S. was recorded on 14 Mar 2020. This comparison is minimally useful, however, because the populations of the two regions differ (U.S. - 328,239,523; EU - 447,206,135) and there are a number of other factors (e.g., population density, health care systems, prevalence of comorbidities) that are not consistent between the two.





Individual States






Counties






Community Mobility Data

For the purpose of assisting the global COVID-19 pandemic response, Google has made available detailed mobility estimates relative to local baselines obtained from mobile phone and other data of the type used by traffic, etc., services like Google Maps and Waze. The data are provided by Google in the form of Community Mobility Reports.

As global communities respond to COVID-19, we’ve heard from public health officials that the same type of aggregated, anonymized insights we use in products such as Google Maps could be helpful as they make critical decisions to combat COVID-19.

These Community Mobility Reports aim to provide insights into what has changed in response to policies aimed at combating COVID-19. The reports chart movement trends over time by geography, across different categories of places such as retail and recreation, groceries and pharmacies, parks, transit stations, workplaces, and residential.

The data used for the analysis below is current through 03 Jul 2020.




U.S.


Note: The dotted grey line on each of the mobility charts represents the 13 Mar 2020 date on which the U.S. declared a National Emergency Concerning the Novel Coronavirus Disease (COVID-19) Outbreak.




Individual States







Data Abnormalities

Analysis of the New York Times United States reported death data reveals a repeating weekly pattern in which the updates on Sunday and Monday are consistently lower than those reported on the other days of the week. As mentioned in the data analysis description in the Background section, the risk estimation algorithm has been configured to reduce the effect of this variation on the statistical model.